III. APPLICATION IN THE ITEC PROJECT
iTEC promotes an educational practice in which students
interact in small projects which include participation in events,
speeches with experts, and all that seasoned with the use of
technology. The leiv motiv of this new paradigm is “Resources
beyond Content”. Thus, in iTEC educational resources go
beyond educational content, frequently consumed in the form
of textbooks, and they include technological resources, as
well as cultural events and people who are experts on some
knowledge area.
The initial assumption was that teachers were going to
have a hard time trying to figure out which events could
have the greatest relevance among such an enormous offer.
This hypothesis was the main motivation for the design,
development, and posterior rolling out of the SDE (Scenario
Development Environment) [7], [8], which is a software
system that works as a recommender, in such as way
that a certain teacher, during the phase of preparing an
educational experience, may rely on the SDE for selecting
the most interesting events to be attended by learners during
the performance of the educational experience. The SDE’s
recommendation algorithm has several factors into account,
such as the appropriateness of the event and the proximity of
the event to the school [9].
After three years of project and more than 2000 pilots in
schools across Europe, that starting hypothesis has shown to
be wrong. That is to say, the number of events registered at the
P&E directory is not of the scale that it was supposed to be,
and thus it does not makes necessary to use a recommender.
The main reason for the low number of registered events is that
the registering process is very time consuming, in addition to
being very error-prone. In order to tackle this drawback, it was
implemented and integrated in the SDE an enrichment module,
which automatically extracts and processes huge amounts of
data coming from relevant Web sites that list applications,
events, and experts in different areas of knowledge. Figure
2 shows events that were extracted from the Web in the SDE
user interface.
A. Sources of information
All across the Internet there are a great number of websites
that offer information on applications, events, and experts. On
applications we extract information from three websites that
we find particularly interesting:
– Softonic
7
is a huge repository of information on
standalone applications. From every entry on that
repository we can extract information such as its
description, the operating system required, tags, and what
is very important: the rating from users.
– Softpedia
8
is, similarly to softonic, a big database of
information about applications.
7
http://www.softonic.com
8
http://www.softpedia.com
– AlternativeTo
9
is an incredible resource for extracting
meaning about the features of an application. For
each application, AlternativeTo provides a list of
“substitutes”—applications with similar functionalities
and that may suit a similar need. This is extraordinarily
useful from the point of view of populating a
database aimed at serving as the knowledge base
of a recommender, because we can extract very
relevant information that may lead to more accurate
recommendations. As an example, AlternativeTo enables
us to extract information such as: Skype is similar to
Google Hangouts, and Firefox is similar to Chrome.
Regarding experts, two websites are the most relevant ones:
– Google Scholar
10
is an enormous database of researchers.
From each registry we can extract very useful infor-
mation, such as the name and position of researchers.
To extract the information on their location—which
in the context of iTEC is a very important piece of
information, because experts of a near location should
gain relevance in an hypothetical recommendation—we
use an mechanism that takes as an input the position
of the researcher (which is a text string that usually
contains their affiliation) and their email address. The
email address, in most cases, enables us to get the IP
address of the expert’s institution—provided that the
institutions hosts the mail server, which is often true. The
affiliation can be processed with NLP software
11
to get
rid of the position and retrieve the institution, and then
geocode the institution.
– LinkedIn
12
is a huge social network targeted at
professionals. Unlike Google Scholar, which sorts entries
by relevance
13
, in LinkedIn we cannot measure the
relevance of a particular professional. In order to
overcome that difficulty—it is neither reasonable nor
efficient to replicate all the LinkedIn public entries—we
rely on Google’s relevance calculations. Let’s see that
with an example, if we want to find experts on, let’s say,
biology we look for “biology site:www.linkedin.com” in
Google. In this way, we get a bunch of search results
ordered by the relevance that Google assigns to those
entries.
In order to get events we need to follow a brute-force
approach
14
. All across Europe there are a great number of
9
http://alternativeto.net
10
http://www.scholar.google.com
11
To that end we use the Geocoder library, which is available for the Ruby
programming language.
12
www.linkedin.com
13
Google Scholar measures relevance as the number of references that
researchers gathered to all their publications.
14
At the time of writing, we extracted information from the following web-
sites: www.spainisculture.com, www.discoveringfinland.com, www.unesco.
org, www.finnbay.com, www.openeducationeuropa.eu, www.visitportugal.
com, www.ulisboa.pt, www.uio.no (the University of Oslo), visit-hungary.
com, visitbudapest.travel, visitbrussels.be, www.belgica-turismo.es, www.
ualg.pt (University of Algalve), noticias.up.pt (University of Porto), www.
globaleventslist.elsevier.com (worldwide conference registry).
71 Polibits (49) 2014ISSN 1870-9044
Information Extraction in Semantic, Highly-Structured, and Semi-Structured Web Sources